AITopics | momentum sgd

In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $β$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.

artificial intelligence, machine learning, provable suboptimality, (14 more...)

arXiv.org Machine Learning

2601.12238

Country:

Asia > Middle East > UAE (0.04)
North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Neural Information Processing SystemsDec-25-2025, 15:37:28 GMT

Communication overhead is a major bottleneck hampering the scalability of distributed machine learning systems. Recently, there has been a surge of interest in using gradient compression to improve the communication efficiency of distributed neural network training. Using 1-bit quantization, signSGD with majority vote achieves a 32x reduction in communication cost. However, its convergence is based on unrealistic assumptions and can diverge in practice. In this paper, we propose a general distributed compressed SGD with Nesterov's momentum.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

Zhu, Meng, Xiao, Quan, Min, Weidong

arXiv.org Machine LearningNov-21-2025

Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamNX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to momentum SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamNX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamNX.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2511.13465

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > China > Jiangxi Province > Nanchang (0.05)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

968c9b4f09cbb7d7925f38aea3484111-Supplemental.pdf

Neural Information Processing SystemsNov-15-2025, 04:57:42 GMT

algorithm, compression error, errorcompensatedx, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

ErrorCompensatedX: error compensation for variance reduced algorithms

Neural Information Processing SystemsNov-15-2025, 04:57:38 GMT

Communication cost is one major bottleneck for the scalability for distributed learning.

algorithm, compression error, errorcompensatedx, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.31)

Add feedback

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, Roger B. Grosse

Neural Information Processing SystemsAug-20-2025, 06:18:34 GMT

Neural Information Processing Systems http://nips.cc/

batch size, plain sgd, sgd, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)

Add feedback

Boosted CV aR Classification (Supplementary Material)

Neural Information Processing SystemsAug-17-2025, 01:19:57 GMT

On the COMP AS dataset, we use a three-layer feed-forward neural network activated by ReLU as the classification model. For optimization we use momentum SGD with learning rate 0.01 and The batch size is 128. On the CelebA dataset, we use a ResNet18 as the classification model. The remaining 45000 training samples consist the training set. The batch size is 128.

artificial intelligence, cv ar null 0 1, machine learning, (15 more...)

Neural Information Processing Systems

Country: